168 ◾ Bioinformatics
process of mapping reads to a reference genome producing a SAM/BAM file that contains
the mapping information. Refer to Chapter 2 for the read mapping and the content of
SAM/BAM files. When dealing with RNA-Seq data, we can either align the reads to a
reference genome or a reference transcriptome. When we align RNA-Seq to a eukaryotic
reference genome, we must use an aligning program like STAR that is able to detect the
splice junctions. The reads in this case will map to the exons leaving introns and other
non-coding regions of the genome uncovered. On the other hand, when aligning RNA-Seq
reads to a reference transcriptome, the aligned reads may cover the entire sequence. This
strategy is preferable when reads are very short (less than 50 bases). The downside of align-
ing reads to a transcriptome is that we may miss some novel genes since the transcriptome
is made up of only known transcripts. As discussed in Chapter 2, there are several align-
ers; however, for RNA-Seq, we prefer to use a splice-aware aligner that is able to introduce
long gaps to span introns when aligning reads to a reference genome. The commonly used
aligners for RNA-Seq data include STAR [5], segemehl [6], GEM [7], BWA [8], BWA-MEM
[8], and BBMap [9].
Before deciding on which of the aligners to use with RNA-Seq reads, make sure that the
aligner is splicing-aware and able to distinguish between reads aligned across exon–intron
boundaries and reads with short insertions [10]. The splicing-aware aligners include STAR
[5], GSNAP [11], MapSplice [12], RUM [13], and HISAT2 [14]. Each of these aligners has
different advantages and disadvantages in terms of memory efficiency, performance, and
speed. Refer to the user guide of any of these aligners to learn more about them. We will
use STAR (Spliced Transcripts Alignment to a Reference) as an example aligner for align-
ing RNA-Seq data. Several studies found that STAR is one of the most accurate aligners of
RNA-Seq reads [15]. However, STAR requires a large memory for indexing and mapping.
The reference sequence must be indexed by STAR before alignment. STAR begins mapping
process by aligning the longest reads that exactly match a single or multiple location on the
reference sequence. For partially aligned reads, STAR will attempt to align the unmapped
region to a different region. Those parts of the reads which align to different locations of
the reference sequence are called seeds. If STAR does not find an exact match to a read on
the reference sequence, the read will be extended by inserting gaps. If the extension does
not give a good alignment, it will be removed. In the second step of the STAR alignment
process, multiple seeds will be clustered based on proximity to a set of anchor seeds. The
clustered seeds are stitched together based on the best alignment score [5].
When reads are mapped to a reference sequence, the percentage of mapped reads reflects
the quality of the alignment. Low percentage indicates contamination of the DNA. Read
coverage and depth on exons are other factors that determine alignment quality.
Above, we have downloaded the FASTQ files in the directory “fastq”. We can map reads
in the FASTQ files to a reference genome using STAR program. STAR is a short read aligner
designed to align RNA-Seq reads to a reference sequence (genome or transcriptome). For
aligning reads in the FASTQ files using STAR, we need to download a reference sequence
together with its annotation file in GTF format. Since our example FASTQ files are from
human samples, we need to download the latest human genome and its annotation file.